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r Abstract 

In this paper, we develop a new method to sample from posterior distributions in hierar- 
chical models without using Markov chain Monte Carlo. This method is generally applicable 
to high-dimensional models involving large data sets. Illustrative analysis exemplifies the 
ease with which one could implement our method, which results in independent samples from 
^ the posterior distributions of interest. 
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1 Introduction 



Markov chain Monte Carlo methods have been around for a long time. Since the seminal paper 
by Gelfand and Smith (1990), the Bayesian computational literature has exploded. It would be 
difficult to list the tens of excellent papers that deal with so many aspects of MCMC and other 
Monte Carlo methods. Hence, we cite three books, and the hundreds of references therein, that 
have helped statisticians over the years: Gelman et al. (2003), Liu (2001), and Chen et al. (2000). 
Brooks et al. (2010) includes more recent advances in MCMC methodology. The popularity 
of MCMC is due, in part, to its general applicability and theoretical properties. However, it is 
well-known that the main drawback of using MCMC is that dependent samples are generated 
if, as is commonly done, a single chain is used. Further, knowing how long to run the chain 
and other aspects, such as the burn-in length, are critical considerations. Indeed, there is a vast 
literature that addresses the difficulties and proposed remedies in diagnosing convergence of 
MCMC chains; again, see the above book references, and the citations therein. 

In this paper we introduce a new estimation algorithm, Generalized Direct Sampling (GDS), that 
generates independent samples from a target posterior distribution in parallel, unlike MCMC 
for which, in the absence of parallel independent chains, samples are collected sequentially. 
There is no need to concern oneself with issues like chain convergence and autocorrelation. 
GDS builds on the basic ideas of Direct Sampling (DS) by Walker et al. (2010). However, DS 
suffers from two serious shortcomings that limit its broad applicability: the failure to separate 
the specification of the prior from the specifics of the estimation algorithm; and an inability to 
generate accepted draws for even moderately sized problems (the largest number of parameters 
that Walker et al. consider is 10). Our interest is in conducting full Bayesian inference on complex 
hierarchical models, with or without conjugacy but with thousands of parameters, common in 
many disciplines, sans MCMC. One key inspiration for a non-MCMC approach to doing Bayesian 
inference for such hierarchical models is perfectly articulated by Papaspiliopoulous and Roberts 
(2008): "However, to date, there has been little theoretical analysis linking the stability of the Gibbs sampler 
to the structure of hierarchical models." 

In the GDS method, the only restrictions are that one must be able to compute the unnormalized 
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log posterior of the parameters (or a good approximation of it); that the posterior distribution 
must be bounded from above over the parameter space; and that one is able to locate any local 
maxima of the log posterior function using available computing resources. There are no condi- 
tions of conjugacy or even conditional conjugacy. Although we do not claim that GDS should be 
the preferred estimation method for all hierarchical models (and we will discuss the limitations), 
we find that these conditions are easily met for most parametric models, especially given the 
maturity of existing nonlinear optimization algorithms. 

We introduce GDS in Section |2j and describe its advantages relative to other estimation methods.. 
Section [3] includes several examples of GDS in action: a small, but important, two-parameter ex- 
ample for which MCMC is known to fail, but for which GDS works; a simulation study that 
compares the estimation of a simple probit regression model using MCMC and GDS; an appli- 
cation of GDS on a non-conjugate, hierarchical model with over 2, 000 parameters; and an online 
advertising application with nearly 30, 000 parameters. Finally, in Section |4j we discuss practi- 
cal issues that one should consider when implementing GDS, including some "lessons from the 
field" that we learned in preparing this paper, while also highlighting some aspects of GDS that 
require additional investigation. 



2 Generalized Direct Sampling: The Method 

GDS is a variant on well-known importance sampling methods. The goal is to sample 8 G Q from 
a posterior density n{6\y) <x f(y\8)n(8) = p(9), where sup n{9*\y) = f(y\6*)n(6*) = c < oo, y is 
some observed data, and 8* is the mode of the data likelihood. Just like in importance sampling, 
we consider a proposal density g(8), where we can write 

*(%) « }M ^m- m 

We will discuss the specifics of choosing g(8) later, but one desirable characteristic is that g{8) 
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and Tc{6\y) have the same mode at 0*. Define k = g{6*), and define 



In simulation, it is well-known that any density n(-) is equal to the probability that an auxiliary 
variable u ~ Uniform(0, 1) is less than n(-). Equivalently if we choose g{9) such that < O(0) < 
1 holds, then 7t(9\y) <x l(w < O(0)) when u is drawn from a uniform density on (0,0(0)). Of 
course, this does not imply that O(0) assumes only or 1. And so, we can write the posterior 
joint distribution of and u as 



p(9,u)<xl( U <<P(9))g(0), (3) 

where 1 denotes the indicator function. One can then refactor Equation[3]as p(0, u) = p(9\u)p(u). 

Clearly, the marginal densities of and u are p(6) = n{9\y) and p(u) = cty~ l Tl{A u ), respectively, 
where A u = {6 : O(0) > u} (the set of all values of for which O(0) > u holds), H(A) = 
j A 7z(6)d6 (the prior, integrated over some domain of values of 0, and xp = J f(y\0)n(9)dO (the 
integrated likelihood of the model). The idea is to simulate from the marginal distribution p(u), 
and then draw from p(9\u), to get draws from p(0) = n(9\y). 

How does one simulate from p{u)l Since p{u) is a monotonic, non-increasing function on [0,1], 
to sample from p(u), Walker et al. approximate a continuous function ^(w) = ipp(u)/c at the 
discrete points {i/ N : i = 1, . . . , N}, with a Bernstein polynomial, (f(w), for some arbitrarily large 
N. 

N w 

Note that q[u) is the probability that O(0) > u for a draw from the proposal distribution g{9). 
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To approximate the q(i/N) term in[I| one can repeatedly draw from g(6) M times, and count, so 

M 

q M (i/N) = M- 1 £ > »/N). (5) 

Equation [I] is a mixture of beta distributions, with cjM(i/N) as the weight for each i. Hence, to 
simulate from p(u), draw from a beta distribution with parameters i + 1 and N — i + 1, choosing 
z' with probability proportional to q^i/N). Walker et al. show that this mixture approximation 
can be made more accurate by judiciously choosing N and M. A large M corresponds to a more 
accurate approximation to q(i/N), and a large N corresponds to a more accurate approximation 
to p(u), given the approximation to q(-)- Under these assumptions, we could, for each desired 
draw from n(9\y), take a single draw of u from the beta mixture, and then repeatedly draw from 
g(6) until > u. We would then repeat this process for as many draws from the posterior as 
we need. To be clear, there are two sets of proposal draws from g(9): taking M proposal draws 
for the purpose of estimating qw(i/N) for all i, and the candidate posterior samples for each 
draw in the rejection sampling phase. 

The approximation for p(u), described above, suffers from a troublesome numerical problem that 
will cause it to fail in high-dimensional hierarchical models. For a given N, the "lowest" beta 
density component in|4]has parameters 1 and N + 1. As an example, suppose that N = 50000. 
The 10~ 6 quantile for that beta distribution is exp(— 24.6). For complex, nonconjugate models, 
for which the parameter count enters the thousands, it is highly unlikely that logO(0) would 
ever exceed -24.6. If this is the case, GDS would never generate an accepted draw. 

Our proposed remedy is to use Bernstein polynomials to approximate a function that is equiv- 
alent to q(u), but that lets us evaluate the Bernstein coefficients (the mixture weights) at points 
that are spaced on the logarithmic, instead of linear, scale. To do this, we evaluate logO(0) 
for all of the M proposal draws of 6, and define Z as the lowest (most negative) logO(0) from 
these proposals. We treat Z as if it were, for all practical puposes, a lower bound on log<3>(0), 
although, of course, it may not be in a formal sense. Next, define v = log(w)/|Z| + 1. It is easy 
to show that not only is < v < 1, but also that if O(0) > u, then log(O(0))/|Z| + 1 > v, and 
thus, that q(u) = q(v). So our strategy is to use the Bernstein polynomial to approximate q(v) 
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(instead of q(u)), draw v from the implicit mixture of beta distributions, and transform back so 
that logw = \Z\(1 — v). This gives us draws of u that are of the same order of magnitude as 
0(0), and leads to a GDS sampler that works in high dimensional problems. There are mono- 
tonic transformations other than the logarithm that one might consider, but we will stick with 
the logarithm to keep things simple. 

To summarize the GDS algorithm: 

1. Find the posterior mode of n(6\y), 9* and compute the unnormalized posterior density 
c = Tc(6*\y) at that mode. 

2. Choose a distribution g(9) so that its mode is also at 9*, and let k = g(9*). 

3. Sample 9 1 ,...,9 M independently from g{9). Compute <Z>(6 m ) = f^V\ e m) n { e m) k_ foj . 
these proposal draws. If O(0 m ) > 1 for any of these draws, repeat Step 2 and choose 
another proposal distribution for which < O(0 m ) < 1 will hold. 

4. Let Z be the minimum logO(0 m ) from the proposal draws. Evaluate 

,M(,7N) = M-'El( !5 ^^+l>^) (6) 

for i = 1, . . . , N. By definition, qM will be monotonic, with ^m(O) = 1 and (Jm(1) = 0. (Later 
we discuss that q((N — 1)/N) >0is not a necessary condition, but, in practice, it better 
enables the method to work in high dimensions.) 

5. Sample v by sampling from beta(z' + 1, N — i + 1), choosing i with probability proportional 
to qM(i/N). Compute logu = Z(l — v). 

6. Sample 9 from g{9) until 0(0) > u. Consider this first accepted draw to be a single draw 
from the target posterior n(9\y). 

7. Repeat steps 5 and 6 until a sufficient number of draws are collected. 

Making a wise choice for g(9) turns out to be the critical part of this algorithm. In practice we 
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found that a multivariate normal proposal distribution with mean at 9* and covariance matrix 
of the inverse Hessian, multiplied by a sufficiently large constant s, works well. We did consider 
some other methods of selecting g(9) that proved less successful. For example, since c/k is a ratio 
of constants, one might be tempted to find the 6 that maximizes f{y\9)n(9)/g(9), and replace 
c/k with the value of this objective function at the minimizer. The problem with this approach is 
that this objective function may not always be bounded (for example, in the case of heavy tails), 
and is likely multimodal. Our proposed "mode-matching and Hessian-rescaling" strategy has 
the advantage of requiring only a single optimization step, namely finding the posterior mode. 
Incidentally, searching for the posterior mode is considered, in general, to be "good practice" 
for Bayesian inference even when using MCMC; see Step 1 of the "Recommended Strategy for 
Posterior Simulation" in Section 11.10 of Gelman et. al (2003). Therefore, we do not consider 
this optimization step to involve much incremental effort over what we might already consider 
doing for an MCMC estimation. But the choice of s for Gaussian proposals, or the choice of 
other functional forms for g(9), requires some thoughtful consideration. Specifically, if the target 
posterior has heavier tails than the scaled g(9), there is some probability that > 1 for some 
draws in the tails. We defer discussion of these points to Section |4j 

2.1 Advantages of GDS 

The primary advantages of GDS are that (a) independent samples from the target density are 
obtained, possibly in parallel; and that (b) it is straightforward to implement. Other than the 
computer code that is needed to implement the steps of the GDS algorithm, all that is required 
are: 

1. a function that computes the log of the unnormalized target posterior density; 

2. functions to sample from, and to compute the density of, g{9), which is typically a standard, 
known density; and, optionally, 

3. a function to compute the gradient (first derivative) of the log posterior; this is needed only 
when using a mode-finding algorithm that exploits gradient information, or, when using a 
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multivariate normal proposal distribution, to estimate the Hessian matrix at the mode. 

The log posterior density is additive across the components of the hierarchical model; i.e., 
log n(9\y) = log f(y\9) + log 7r(0|-) + ... This makes it easy to make changes to the model 
specification. To try a different prior, just change the function that computes that one additive 
term. In contrast, Gibbs sampling requires combining terms that have common parameters into 
conditional posterior distributions. Whether or not model terms possess conditional conjugacy 
has a direct effect on the simplicity of the implementation of the sampler. Thus, researchers 
are inclined to choose certain priors over others for computational convenience even if they are 
not entirely appropriate for the problem. Without conditional conjugacy, the researcher needs 
another way to draw from each conditional posterior, and there are few options that are both ef- 
ficient and easy to estimate in high dimensions. As examples, consider the difficulty in adapting 
the scale parameter of a random walk Metropolis-within-Gibbs update, or computing high-order 
derivatives for Hamiltonian-like MCMC alternatives as in Girolami and Calderhead (2011). Thus, 
there is quite a bit of work facing the researcher who wants to build an efficient MCMC sampler, 
especially when conditional conjugacy is not present, or when she wants to make changes to 
the model. One key example is finding more flexible or informative alternatives to the inverse 
Wishart distribution as a prior on covariance parameters (Gelman 2006). 

2.2 Relationship to other methods 

In many problems, GDS is an attractive alternative to MCMC because the samples from the 
posterior distribution are independent, hence fewer draws are needed than an MCMC algorithm 
to achieve an equivalent effective sample size. Additionally, the draws can be collected in parallel. 
The Walker et al. method of Direct Sampling also has these characteristics. However, there are 
concerns that DS will not work for many common and important models. Most importantly, 
sampling from the prior may be inefficient if the prior is inconsistent with the data, or if the 
prior is much more diffuse than the posterior. Walker et al. propose to mitigate the problem of 
sampling from the prior by factoring the posterior distribution differently; for instance, treating 
some terms in the likelihood as being part of the prior. Alas, there is no guarantee that such 
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a factorization is possible for many models, and without conditional conjugacy the ability to 
execute such a factorization is unlikely. Another approach to improving the efficiency of DS 
might be to simply choose 7i(9) to be more consistent with the data, and thus increase the 
acceptance probabilities of draws from the prior. But this approach means modifying the prior 
based on initial posterior estimates, and violates the basic tenet of separating the choice of model 
from the details of the estimation method. 

Thus, MCMC dominates DS on one important dimension: its broadly general applicability. No 
matter how complicated the model, it is usually possible to construct some kind of sampling 
algorithm, such as full-dimensional Metropolis-Hastings, blockwise Gibbs sampling, Metropolis- 
within-Gibbs, Hamiltonian Monte Carlo, slice sampling, etc. In this regard, MCMC sampling 
is often the "first choice" method for researchers who deal with large, hierarchical models even 
though efficient implementation may not be easy to accomplish. Likewise, the Propp and Wil- 
son (1996) perfect sampling approach could be attempted, but this method is non-trivial to im- 
plement, especially in large dimensions. Besides, this approach depends on drawing from a 
stationary Markov chain, whereas the GDS method is completely non-Markovian. 

The key advantage to GDS is that it can exploit parallel computing technology. Some researchers 
have investigated alternative approaches for MCMC-based Bayesian inference that also take ad- 
vantage of parallel computation; see, for example, Suchard et al. (2010). One notable example 
is a parallel implementation of a multivariate slice sampler (MSS); see Tibbits et. al. (2010). The 
benefits of parallelizing the MSS come from parallel evaluation of the target density at each of 
the vertices of the multivariate slice, and from more efficient use of resources to execute linear 
algebra operations (e.g, Cholesky decompositions). But the MSS itself remains a Markovian al- 
gorithm, and thus will still generate dependent draws. Using parallel technology to generate a 
single draw from a distribution is not the same as generating all of the required draws them- 
selves in parallel. On the other hand, the sampling steps of GDS can be run in their entirety in 
parallel. Note that Step 6 of the GDS algorithm is the usual density to sample at each iteration of 
a Gibbs sampler if one were deploying a MSS, so GDS will result in a non-Markov sequence of 
samples at no more than the computational price as the Markov sequence appearing in Tibbits 
et al. (2010). 
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3 Illustrative Analysis 



We now provide some examples of GDS in action, especially on problems for which MCMC 
fails, or for which the dimensionality, model structure, and sample size make MCMC methods 
unattractive. 

3.1 A Hierarchical Non-Gaussian Linear Model 

Consider this motivating example of a linear hierarchical model discussed by Papaspiliopoulous 
and Roberts (2008); they provide excellent insights on why MCMC fails even in this deceptively 
simple illustration. 

Y = X + ei (7) 
X = + e 2 (8) 

where each Y is an observation, each X is the latent mean for the prior on Y, and e\ and e 2 are 

random error terms, each with mean 0. Papaspiliopoulous and Roberts note that to improve the 

robustness of inference on X to outliers of Y, it is common to model e\ as having heavier tails 

than €2 . Let e\ ~ Cauchy(0, 1), er ~ N(0, 5), and ~ N(0, 50000), and suppose there is only one 

observation available, Y = 0. We plot the posterior joint distribution of X and in Figure [TJ the 

contours represent the logs of the computed posterior densities. We see that around the mode, 

X and appear uncorrelated, but in the tails they are highly dependent. Papaspiliopoulous 

and Roberts show that the Gibbs sampler performs extremely poorly in this case. Indeed, they 

note that almost all diagnostic tests will erroneously conclude that the chain has converged. The 

reason for this failure is that the MCMC chains are attracted to, and get "stuck" in, the modal 

region where the variables are uncorrelated. Once the chain enters the tails, where the variables 

are more correlated, the chains moves slowly, or not at all. 

1 Liberal use of the R statistical software package (R Core Development Team 2011), and the GNU Scientific Library 
(Galassi et. al. 2009) was made during the course of this research. All graphs and plots in this paper were generated 
using the ggplot2 R package (Wickham 2009). 
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Figure 1: Contours of the "true" posterior distribution of the non-Gaussian linear model example. 
To start GDS, the posterior mode, and Hessian of the log posterior at the mode, are 9* = (0,0) 



Our GDS proposal distribution g(6) is a bivariate normal with mean 8* and covariance — sH _1 , 
with s = 37. This scaling factor was the smallest value of s for which < 1 for all M = 20, 000 
of the proposal draws. After setting N = 30, 000, we followed the GDS algorithm to collect 50,000 
independent posterior draws of X and 0. 



posterior density. We see that GDS not only picks up the correct shape of the regions of high 
posterior mass near the origin, but also the dependence in the long tails. As a point of contrast, 
consider Figure |3j which shows, on axes of the same scale, the estimated posterior that we 
would have got with a Laplace approximation by collecting all samples from a MVN(0, — H _1 ) 
distribution. Not only is this approximation more concentrated around the mode, but it fails to 
capture any of the tail dependence in the true distribution. 



and 




In Figure [2j we plot each of the GDS draws, where darker colors represent higher values of log 
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Figure 2: Posterior draws from non-Gaussian linear regression example, using GDS. Darker colors repre- 
sent regions of higher posterior density 
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Figure 3: Posterior draws from non-Gaussian linear regression example, using a Laplace approximation. 
Darker colors represent regions of higher posterior density 
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Attempts 1 2 3 456789 10+ 

Samples 48126 1362 289 108 54 17 15 5 4 20~ 

Table 1: Number of attempts for each sample from the posterior distribution in the non-Gaussian linear 
hierarchical model example. Maximum number of attempts is 624. 

This contrast also helps us understand why, in this case, we needed to scale the Hessian by such 
a large factor to get a valid proposal distribution. The tails of the multivariate normal are much 
thinner than that of the target posterior, and so without a diffuse proposal distribution it is likely 
that > 1 for many of the proposal draws. But we also recognize that the weights in the q 
function are rescaled with respect to the lowest from among the proposal draws. One might 
be concerned that "over-scaling" would make it too easy for a bad proposal draw to be accepted 
into the posterior. But GDS corrects for this possibility by transforming the acceptance threshold 
in the same way, hence GDS will whittle away at the proposal distribution, so that only samples 
from the posterior distribution remain. Table [T] presents the number of attempts required for each 
of the 50,000 posterior draws. We see that the vast majority of proposal samples are accepted on 
the first attempt, but that there are many draws for which it takes many attempts before getting 
an acceptance. Of course, these draws correspond to draws with higher values of the acceptance 
threshold u. Our point here is that GDS remains selective when accepting proposal draws, even 
though the acceptance threshold for some, but not all, of the draws may be low. 

Finally, we note that the sequential Monte Carlo algorithm also works well for this example 
(Carvalho et al. 2010). 

3.2 The Probit Regression Model 

Although there is no generally agreed upon "gold standard" that one could use to compare GDS 
with any other estimation method, we are typically interested in the similarities of the estimates 
of the posterior distributions, Monte Carlo variance of the estimates of the posterior means, and 
the computational resources required to achieve comparable effective sample sizes. To get a sense 
of how well GDS compares to MCMC along these dimensions, we conducted a simulation study 
for the binary probit regression model using the same parameters as in Walker et al. (2011), and 
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which consists of 25, 000 samples. 

In a probit model, for individuals i = 1 . . . H, we observe a binary outcome Z; G (0, 1 ) . Whether 
or not an individual's outcome is or 1 depends on an unobserved latent variable y, = X\Q + £,, 
where X\ is vector of observed covariates for person i, 9 is a vector of coefficients, and e, ~ N(0, 1) 
is a source of random variation. The data likelihood is 



H 



n(6\z 1:H ,x 1:H ) oc 7r(0)fj 



i=i 



I(yi > 0)I(zi = 1) + % < 0)7(z ; = 0) • N(y;|* f 0,l) 



(9) 



Details on the MCMC sampling algorithm for the probit model are found in Albert and Chib 
(1993), and our implementation of the algorithm is from the MCMCpack R package (Martin et. 
al. 2011). A Generalized Gibbs Sampler method was developed by Liu and Sabbati (2000) that 
does better than the Albert and Chib algorithm. But as we noted before, for any model one 
could always come up with a better Gibbs sampling method, or a Metropolis-Hastings algorithm. 
Indeed, for this illustration the Albert and Chib method is quite efficient if we worked with a 
reparametrized form of the model, as discussed in Nandram and Chen (1996). But in the types 
of models we have in mind, reparametrizations may not be obvious at all. And since many 
statistical practitioners may gravitate to computer code that is readily available, the Albert and 
Chib sampler seems like an appropriate illustrative benchmark. 

We started the MCMC sampler at the posterior mode, and generated 1,000,000 draws; this took 
67 minutes on a recent vintage Apple Mac Pro with two Intel Xeon X5670 processors and 32GB of 
RAM. Time series plots for these draws are in Figure |4j In Table [2] we show the effective sample 
sizes for these 1,000,000 draws for each of the 10 parameters, as well as results of a Raftery-Lewis 
diagnostic test (Raftery and Lewis 1996) for determining the number of iterations required to 
estimate the posterior median within 0.01 with probability 0.95. These statistics were computed 
using the coda R package (Plummer et. al., 2010). Also shown are the lag-250 autocorrelations 
for the 10 parameters. 

The results in Table [2] should be somewhat striking to a reader expecting that 1 million draws 
should be sufficient to estimate a 10-parameter binary probit model. But we see that the effective 
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Effective 
Sample Size 




Burn-In (M) 


Raftery-Lewis Diagnostics 
Required Draws (N) Scale Factor (I) 


ACF-250 


1 


11585 




296 


971916 


101 


.173 


2 


256 




27324 


82969920 


8640 


.938 


3 


3296 




4998 


15474522 


1610 


.542 


4 


916 




13923 


41022072 


4270 


.814 


5 


105 




33000 


100016400 


10400 


.971 


6 


17 




30728 


94116552 


9800 


.991 


7 


241 




22300 


68337012 


7120 


.940 


8 


165 




22743 


69260738 


7210 


.957 


9 


133 




39783 


121862928 


12700 


.964 


10 


19731 




115 


382053 


39 


.014 



Table 2: Effective sample sizes, Raftery-Lewis diagnostics, and lag-250 autocorrelations, for the binary 
probit example. Effective sizes are based on 1 million draws. Raftery-Lewis statistics were generated from 
250,000 pilot draws, and represent the number of burn-in and collected iterations needed to estimate the 
posterior median within 0.01 with probability 0.95 for the 10 parameters. The equivalent number of GDS 
independent draws is 9,607. 

sample size for many of the parameters, relative to the total number of draws, is quite low. And 
if we wanted to estimate the posterior median at the precision and confidence suggested by the 
Raftery-Lewis test for all parameters, we would need to run the chain for nearly 122 million 
iterations! Moreover, barring two parameters, the lag-250 autocorrelations are very large. In 
contrast, using GDS, just under 10,000 independent draws from the marginal posteriors for all 
parameters would be sufficient. 

For these draws, the GDS proposal distribution is multivariate normal with a mean at the poste- 
rior mode, and a covariance matrix set to the inverse Hessian at the mode, scaled by 1.04 (again, 
this scale is the smallest factor for which < 1 for all proposal draws). Figure [5] compares the 
posterior distributions of the parameters for both GDS and MCMC, and Figure [6] compares the 
posterior means and sampling errors for both methods. In both figures, the box is the interquar- 
tile range, the crossbar is the posterior mean, and the vertical lines span from the 5th to 95th 
percentiles. The dark horizontal lines at the top and bottom of each panel show the 5th and 95th 
percentiles for the posterior interval that one would expect from a Laplace approximation. We 
see that both methods give comparable estimates, but the point estimates for the posterior means 
are more precise using GDS than MCMC. It is true that these GDS estimates are computed using 
a larger effective sample size, but it is unlikely that, in practice, modelers would (or should) need 
to collect 122 million iterations from any method to do inference on such a low-dimensional 
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Figure 4: Traceplots for Gibbs sampling chains (thinned by 500 only to reduce the size of graphics file) 
for the binary probit regression. 
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Figure 5: Posterior distributions of parameters in the binary probit regression example. 



problem. Given the success of GDS in the above two examples, one must be optimistic, or at 
least curious, about its potential for larger problems to which we now turn. 



3.3 Example of a Complex High-Dimensional Model in Management Science 

In the previous illustrations, we knew the "truth". In reality, one never knows if one has arrived 

at the "true" estimates from posterior distributions, especially in high dimensions involving 

large data sets. These next two illustrative analysis are examples of this fact. As described 

above, MCMC algorithms are most cumbersome to implement when there are a large number 

of correlated parameters, and when posterior conditional distributions are non-conjugate. Here 

we illustrate the use of GDS on a management science example. Manchanda et. al. (2004), 

studied the effectiveness of direct sales visits (known as "detailing" visits) to physicians from 

representatives of a pharmaceutical company]^] Let y !f be the number of prescriptions written by 

physician i in month f for a particular brand of drug; Xj t is the number of detailing visits; and 

2 We present a simplified version of the Manchanda et. al. model, ignoring the effect of lags in i/j t . The full 
Manchanda model is even more highly parameterized, and is neither bounded nor differentiable. 
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Figure 6: Posterior means and Monte Carlo error for the binary probit regression example. 

Zj is a vector of exogenous, time-invariant physician characteristics, such as an intercept, type of 
practice, and total number of prescriptions written. Manchanda, et. al. model y# as a negative 
binomial random variable with shape r, mean ]ij t , and log jin = /3,o + fiiiXu, and X[ t as a Poisson 
distribution with mean A;. This model is concerned with how to capture dependence between 
y. and A. Specifically, the number of detailing visits is not exogenous, since sales representatives 
might be more likely to visit physicians who either write more prescriptions or are more respon- 
sive to the visits themselves. One way to capture this dependence is to assume that the sales 
representatives know each physician's values of fiio and f>\\. Then we can model the expected 
number of detailing visits as log A, = 70 + 71/^0 + 72 fin- The hierarchical model is then specified 
as 



y it ~ NBD(r, p it ), log fi it = j3 f0 + pax it x it ~ Poisson(A ! ), log A; = 70 + 7ifro + 720,1 



log|6 ~ MVN(Az,E) 
7~MVN(0,25-7) 



vecA ~ MVN(0, 1000 • I) 
E" 1 ~ Wishart(6, - • I) 
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Note that the conditional posteriors for /3 and 7 are not standard distributions, so any MCMC- 
based estimation algorithm would likely employ Metropolis-Hastings draws, either in full, or 
in blocks. Also, note that jSp; and appear multiplicatively in the posterior with 71 and 72. 
Thus, we would expect high correlation among these parameters' posterior distributions, high 
autocorrelation in the sampling chains, and, as Manchanda et. al. remark, possibly even two 
modes. 

These are the kinds of models for which we expect GDS to be singularly useful. Prescription 
detailing data are available from the bayesm R package (Rossi, 2011). This dataset contains 
prescription and detailing information for 1,000 physicians over 23 months, as well as three 
pieces of physician-specific data: a dummy indicating if the physician is a general practitioner, a 
dummy indicating if the physician is a specialist in the medical discipline for which the drug is 
intended, and the mean number of free drug samples given to the physician. With two elements 
of each /3 ; -, and four elements of each Z; (including an intercept), A is a 2 x 4 matrix and 
has 3 distinct elements. To ensure that is positive definite, we estimate the non-zero lower 
triangular elements of the Cholesky decomposition of instead of elements of S. Altogether, 
there are 2015 parameters in the model. For the GDS proposal distribution we used a multivariate 
normal, centered at the posterior mode, with a covariance of — sH _1 , with s = 1.27. 

We also estimated the model with MCMC using the full-dimensional adaptive Metropolis Ad- 
justed Langevin Algorithm (MALA, Atchade 2006). This algorithm is widely used, is relatively 
simple to code, and allows us to take advantage of gradient information to accelerate conver- 
gence to the stationary distribution. Since we compute the gradient of the log posterior as part 
of the algorithm to find the posterior mode for GDS, we have computer code for the gradient 
already available. Following the recommendation of Atchade (2006), we tune the algorithm to 
achieve an approximate acceptance rate of 0.574. We considered several other algorithms in ad- 
dition to MALA for our baseline implementation, and we have no doubt that there are other 
MCMC algorithms that might perform better. For example, the Girolami and Calderhead (2011) 
Riemannian MALA and Hamiltonian Monte Carlo algorithm, and others have proposed paral- 
lel algorithms based on running parallel chains, or in particle systems. But these algorithms 
are often considerably more difficult to implement than either GDS or MALA; for instance, the 
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Girolami and Calderhead method requires the computation of third derivative tensors at each 
iteration of the MCMC algorithm, and for their Hamiltonian case, one requires an iterative so- 
lution to a discrete differential equation. So even though MALA does not perform particularly 
well, as we explain next, we were hard-pressed to find any other algorithm that would be nearly 
as easy to implement, and would have any hope of working better than GDS. Our thought is 
that MALA is the kind of MCMC algorithm that a skilled applied statistician would implement 
in practice, and therefore serves as an appropriate point of comparison. 

We started the MALA algorithm at the posterior mode, ran it for two days to pass through 
one million burn-in draws, and then collected 2.5 million draws during the next five days. The 
traceplots and autocorrelation functions are in Figures [7] and |8} respectively. It is clear that MALA 
does not perform at an acceptable level for any practical purpose, since the sampling chains for 
many parameters have not achieved convergence even after a full week of computation. This 



might not necessarily be a big problem, since we saw in Section 3.2 that even though the effective 
sample size from the Gibbs sampler was very small, the estimates of the posterior distributions 
were comparable to both the Laplace approximation and GDS; also, the precision of the posterior 
mean estimates were conservative. 

In Figure |9j we present box plots of the marginal posterior intervals for the 15 population- 
level parameters of the model, estimated using MCMC and GDS. In all cases, the posterior 
intervals generated by GDS are wider than those generated by MCMC. For some parameters, 
the posterior intervals from the two methods are reasonably aligned. But for others, the MCMC 
intervals degenerate to almost a single point. This is because the MCMC chains are stuck in 
regions of high posterior density, and that some parameters are so highly correlated that the 
chain cannot escape from that spot. This is the same story as in the Papaspiliopoulous and 



Roberts example in Section 3.1 Correlation across parameters can occur when there is little 
information to identify parameters separately. In fact, Manchanda, et. al. note that because ftoi 
and f>\\ enter the model multiplicatively the posterior could even be bimodal in the absence of 
sufficient information about the effect of detailing on prescriptions. Even with the unimodal 
posterior we have here, it makes sense that f5y, but not f>Qi, would be correlated with 7, since 7 
determines The 7 parameters would also be correlated with £2,2/ (the variance of f>n across 
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Figure 7: Traceplots for MALA estimation of the population-level parameters in the drug detailing 
pie. 
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Figure 8: Autocorrelation plot for the population-level parameters from MALA estimation of the drug 
detailing example. 
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Attempts 1 2 3 456789 10+ 
Samples 18540 802 263 125 72 44 28 21 23 82~ 

Table 3: Number of attempts for each sample from posterior distribution in the drug detailing example. 
The maximum number of attempts is 695. 

the population) and Ei 2 (the population covariance of /3 o; - and fin). Given all these relationships 
among certain parameters, it defies credibility that our estimates of these parameters could be 
as precise as the MCMC output would suggest. As further evidence that the GDS intervals are 
more reasonable than the MCMC intervals, consider the black lines in the panels in Figure [9] 
The distance between the lines is the 90% confidence interval that would be implied by a simple 
Laplace approximation of the posterior. Although there is no reason to expect that the marginal 
posteriors of these parameters should be Gaussian, we are increasingly confident that the GDS 
intervals more likely represent the "true" posterior distribution. This is particularly relevant 
when viewed through the spectrum of the previous two examples. 

Table [3] summarizes the number of proposal attempts from each GDS draw from the posterior. 
The vast majority of samples from the posterior were collected on the first attempt; this is not 
surprising since we expect the posterior to converge to a Gaussian distribution anyway with a 
large number of observations. But we also see that GDS does, in fact, reject many of the proposals, 
suggesting that the posterior distribution is not exactly the same as the proposal distribution. 

To gain some more intuition into how the GDS algorithm decides which draws to accept and 



which to reject, we constructed Figure 11 In this figure, the x-axis indexes the 20,000 samples 
that we want to collect from the target posterior distribution. The y-axis represents values of 
logO(0). The solid curve represents the logw from Step 5 of the GDS algorithm in Section [2j 
ordered from lowest to highest. (The samples from the posterior are exchangeable.) Recall that 
u is sampled from a mixture of beta distributions, with the mixture weights determined by q{v), 



which we show in Figure 10 We then simulated 20,000 draws from the proposal distribution, 
and plotted the logO(0) for each draw on the same y-axis. During the sampling phase of the 
GDS algorithm (Step 6), any proposal draw above the curve would be retained as a draw from the 
posterior, while draws below would be rejected. Note that the y-intercept of the curve is always 
Z, the lowest logO(0) among the proposal draws. But even though the acceptance threshold 
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;ure 9: Posterior intervals for the drug detailing example, as estimated using MCMC and GDS. 
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Figure 10: q function for drug detailing example. 

may be low for some of the targeted draws, many of the corresponding proposals will have 
much higher values of log <3?(#). Hence, GDS permits some draws from the tails of distributions, 
as long as there are draws from the proposal distribution in the posterior tails, but ensures that 
a sufficient number of draws are collected from regions of higher posterior mass. And indeed, 



when we plot the accepted posterior samples against the log posterior densities, as in Figure 12 
we see that there are a number of draws that are concentrated in those high posterior regions. 

To be sure that the accepted draws were sampled from a distribution that is different from the 
proposal distribution, we conducted a Kolmogorov-Smirnov test for all 15 of the population-level 
parameters, against the null hypothesis of being sampled from a Gaussian distribution. The null 
hypothesis was soundly rejected in all cases. 



3.4 An online advertising example of a complex, 30, 000-dimensional model 

In this section, we consider a model for the effectiveness of online advertising campaigns. For 
4,643 anonymous users, the dataset records which ads (if any) from an advertiser were served 
to each user during the course of that user's web browsing activity when these users visited the 
advertisers own website (if ever), and if these website visits resulted in a "successful" visit. Data 
for ad exposures, site visits, and successes were collected weekly. The managerial objective is to 
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Figure 11: The curve is the acceptance threshold for samples from the posterior, ordered from lowest to 
highest. Dots are the log posterior for draws from the proposal distribution. Proposal draws above the 
curve would be accepted, and those below would be rejected. 




Figure 12: Log posterior densities of 20,000 accepted samples from the GDS algorithm. 
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develop a method for firms to identify which versions of ads are most likely to generate site visits 
and sales, taking into account the fact that the return on investment of the ad may not occur until 
several weeks in the future. The model allows each version of an ad to have a contemporaneous 
effect in that week, but each repeat view of the same ad has an incrementally smaller effect. The 
effect of the ad campaign for an individual builds up with each subsequent ad impression, but 
this accumulated "ad stock" decays from week to week. Understanding these relationships is 
useful because if some ad versions are initially ineffective, or become less effective over time, the 
firms can efficiently allocate marketing resources for other activities, and reduce the "clutter" of 
ineffective advertising that many users endure. 

To better understand this behavior, we construct the following hierarchical model. Let n^, v^t 
and Sh t be, respectively, the total number of ad impressions, website visits, and sales conversions 
for user h in week t. For the user-level likelihood, follows a zero-inflated Poisson distribu- 
tion with user-specific rate A/,, follows a zero-inflated Poisson distribution with user-specific 
rate ji^, and follows a zero-inflated binomial distribution with each of the visits having a 
user-specific probability pht of resulting in a success. Furthermore, let Ef,t be the total contempo- 
raneous effect of the ad campaign on user h in week t. This effect depends on which of each of 
the versions of ads that the user sees. Define Cy as a coefficient that captures the relative effect 
of ad version on the overall ad effect, let 5 be an "ad wear-out" parameter that captures di- 
minishing marginal effects for each version, and let 8 be a "restoration" parameter that captures 
the recovery of the ad's effectiveness as T/, t weeks elapse since that user's last viewing of the ad. 
Define Yu] as the cumulative number of impressions of ad version that the user has received 
by time t, and let a. be the week-to-week decay parameter for the time-varying accumulated ad 
stock A}, t . Putting all of this together, 

Y hjt t 

Eht = E C iL 3 " 1 ( 1 + ) T " t A ht =^cc t -% t (10) 

;' i'=l fc=l 

The ad exposure rates, visit rates and success probabilities depend on user-specific baseline 
propensities, A/, , and pho, along with the accumulated (and decayed) ad stock. The im- 
pression rate A^o is time-invariant, but the visit rate and success probabilities jiht and Pht vary 
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according to ad stock, heterogeneous coefficients /3/,„ and j6/, p/ and two seasonality dummies in 

x t 

log ]iu = log Una + hv A ht + 7 x t logit p ht = log pw + hv A ^- (H) 

We then define ^ = {logA/„ logf//,o, logit p^O/ j^, /S^} as a vector of heterogeneous param- 
eters that are distributed across the population according to a multivariate normal distribution 
with mean (p and covariance S. We constrain the component of (p that corresponds to the prior 
mean on f}^ to be 1, to facilitate separate identification of the scale of the C/s. These prior 
parameters, along with C\ : i, ot, 5 and 9, are all unknown, with weakly informative hyperpriors. 

With 4,643 individuals, there are 29,051 parameters in this model. Given how much trouble 
MCMC had with our previous three examples, estimating an even more complex model like 
this one via MCMC seems like a lost cause. We expect there to be high correlation between the 
individual-level parameters and the parameters of their priors, as well as among the Cj, a, 5 and 
9. We did try to estimate this model using MCMC, but after nearly two weeks and over two 
million MCMC iterations, the chain had barely moved. 

Although we could have worked to improve our own implementation of MCMC, in light of 
the previous illustrations, we are pessimistic that such an effort would lead anywhere in real 
time. However, in spite of the apparent complexity of the model, it is straightforward to break 
it down into the additive components of the logs of data likelihood, priors and hyperpriors. 
This decomposition lets us construct functions that compute both the log of the posterior, and 
its gradient, for each component separately, and then add them up. One can then draw from 
the vast existing library of nonlinear optimization algorithms to find the posterior mode; see 
Appendix. This lets us run GDS, with a multivariate normal proposal distribution, centered at 
the posterior mode. The covariance matrix is the inverse Hessian at the mode, scaled by 1.02. 

Table |4]shows the "draws to acceptance" counts for 2,000 samples from the posterior distribution. 
As above the acceptance rates are quite high, even for a model with nearly 30,000 parameters. 
GDS will still carve away at the proposal distribution, leaving the target posterior distribution 
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Attempts 1 2 3456789 10+ 



Samples 1719 146 46 23 15 14 4 4 3 26 



Table 4: Number of attempts for each sample from posterior distribution for the advertising effectiveness 
model. Maximum number of attempts is 296. 



-200 - 



-400 - 




-600 - 



-800 



1000 



2000 
Index 



3000 



4000 5000 



Figure 13: Proposal draws (dots) and acceptance thresholds (curve) for the advertising example. Accep- 
tance thresholds are ordered from lowest to highest. Proposal draws above the curve would be accepted, 
and those below would be rejected. 



behind. Figure [13] shows the scatters of the M = 5000 proposal draws against their values for 
and plots the associated distribution of log u. As before, proposal draws that fall below the 
logu curve would be rejected, and those above would be accepted. Figure [14| plots log posterior 
densities of the final, accepted proposal draws, ordered along the x axis according to the draw of 
u used to determine acceptance or rejection. 

We concede that no one can "prove" that the GDS parameter estimates are the "true" ones in 
these types of high-dimensional models. This is true of any estimation method. But, in conjunc- 
tion with the other examples that we presented in this paper, it gives us confidence that GDS is a 
credible, viable estimation algorithm for large scale Bayesian hierarchical models. Nevertheless, 
we performed a posterior predictive check of the model, using a holdout sample of an addi- 



tional 1,160 individuals. Figure 15 shows the actual number (the dot) and posterior predictive 
distributions (the boxplots) of visits and successes for overlapping subsets of the holdout sample. 
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Figure 14: Log posterior densities of 2,000 accepted samples from the GDS algorithm for the advertising 
response example 

Clearly, this model, using parameters that we estimated by GDS, generates data that conforms to 
what was actually observed. 



4 Practical considerations and limitations of GDS 



This section details some of our experiences while implementing GDS. Like the entire body of 
MCMC research taught and continues to teach us, more insights will likely be gleaned as we, 
and hopefully others, gain additional experience with the GDS method. 



4.1 Finding the posterior mode and estimating the Hessian 



For small problems like the examples in Sections 3.1 and 3.2 standard nonlinear optimization 
algorithms, such as those found in common statistical packages like R, are sufficient for finding 
posterior modes and estimating Hessians, even when the gradients and Hessian are estimated 
numerically. For larger problems, these steps in the GDS algorithm could, potentially, be tricky. 
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Figure 15: Posterior predictive distributions, and observed values, for the total number of visits and 
successes for overlapping subsets of customers in a holdout sample from the advertising example. 



Nocedal and Wright (2006) describe many of these algorithms, and discuss cases in which some 
might work better than others. Our most successful implementation of a trust region optimizer is 
the Steihaug (1983) conjugate gradient algorithm. Nocedal and Wright describe how to combine 
the Steihaug algorithm with symmetric rank-one quasi-Newton estimates of the Hessian. In 
general, this approach worked well for us. 

Estimating the Hessian directly comes into play twice in the GDS algorithm: in the process of 
searching for the posterior mode, and in determining the covariance of a proposal distribution 
that is centered at that mode. The most efficient way to compute the Hessian is, obviously, to 
derive it analytically. However, Hessians can be complicated, and the coding and debugging time 
substantial. For that reason, we want to be able to estimate Hessians using either quasi-Newton 
methods (for the optimization part), or finite differencing once the mode is located. Under con- 
ditional independence assumptions, Hessians are typically sparse, and Powell and Toint (1979) 
explain how to efficiently estimate sparse Hessians using finite differencing and graph coloring 
techniques. Coleman et al. (1985a, 1985b) have written publicly available FORTRAN code to do 
it (the DSSM and FDHS functions in Algorithm 636). As an example, using DSSM and FDHS to 
estimate the 2015 x 2015 Hessian in Section [33] required only 17 finite differencing operations. 
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4.2 Multimodal distributions 



GDS does require finding the global posterior mode, and so far we have considered only models 



with unimodal posterior distributions. In Section 3.3 we mentioned that the Manchanda et. al. 
model can generate bimodal posterior distributions where there is little information in the dataset 
about the effect of detailing on prescriptions. When the posterior is multimodal, GDS provides a 
straightforward way to sample from it: use a multimodal proposal. The idea is to not only find 
the global mode, but any local ones as well, and center each component of a finite mixture at 
each local mode. The GDS algorithm itself is unchanged, as long as the global posterior mode 
matches the global proposal mode. In the interests of space, we omit details on estimating the 
Manchanda et. al. model using a mixture proposal distribution. 

One possible criticism of using GDS for multimodal posteriors is that finding all of the local 
modes could be a hard problem. We agree; there is no guarantee that any mode-finding algorithm 
will find all modes of a posterior distribution. Fortunately, the nonlinear optimization literature 
is rife with methods that help facilitate efficient location of multiple modes, even if there is no 
guarantee of finding all of them. Also, note that even though MCMC sampling chains may, in 
theory be guaranteed to explore the entire space of any posterior distribution (including multiple 
regions of high posterior mass), there is no guarantee that this will happen after a large finite 
number of iterations for general nonconjugate hierarchical models; indeed as we saw in earlier 
sections, even in low dimensions, there is no guarantee that an MCMC chain will converge. 
Other estimation algorithms that purport to be robust to multimodal posteriors offer no such 
guarantees either. 



4.3 Proposal distribution, acceptance rates, adaptive GDS, and selecting s 

In the GDS method, choosing g(6) so that its mode matches that of f{y\Q)n(9) does not guarantee 
that g(6) is a valid proposal distribution. If, for any 9, > 1, then the Bernstein polynomial 

c]m(') is not approximating a monotonic function on (0, 1), and thus simulating from the resulting 
mixture of beta distributions is not equivalent to drawing from p(u). So if q{\) > 0, then we need 
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to start over with a new proposal distribution. Fortunately, the cost of an adaptive approach to 
finding a valid g(9) is small. To start, choose a g(6) function and see if q{\) > 0. If so, refine 
g(9) until ^(1) < 0. For example, if g(9) is multivariate normal, we might iteratively increase the 
scale of the covariance matrix until q(l) =0. 

One might think that a way around the ^m(I) < restriction would be to scale the variance of the 
proposal distribution to be so high that no draw could ever lead to ^m(I) > 0- The problem with 
this tactic is that if the scale of the proposal is much higher than necessary, the Z against which 
the auxiliary variable u is transformed is lower than necessary. This could, in theory, make it too 
easy to accept low probability draws into the posterior. In practice, the proposal is over-scaled 
qu{i/N) = for too low of an i, and we can observe this before starting the rejection sampling 
phase of the algorithm. When this occurs, the q function is approximating a function that is 
bounded from above by something less than one, even though the approximation itself is on 
the interval (0, 1 ) . A heuristic analogy to importance sampling would be sampling the auxiliary 
variable u from a truncated uniform distribution, instead of a uniform over (0, 1 ) . Thus, not only 
should g{9) be adapted such that ^(1) = 0, but also that q((N - 1)/N) > 0. 

This initial adaptation could be completed quickly if one starts with a moderate value for M 
(say, M = 1000), and finds a range of specifications for g{9) that work. Then, one can increase 
M until it is sufficiently large to give a good approximation to q(-). As long as ^(1) = and 
(f((N — 1)/N) > 0, GDS provides valid draws from the target posterior. But even if q(\) = in 
the proposal stage, based on a finite number of proposal draws from g(9), there is no guarantee 
that < 1 for all proposals in the rejection sampling stage, especially if the tails are heavier 
for the target posterior distribution than for the proposal distribution. If > 1 were to occur, 
but rarely, we suggest just accepting the draw and moving on. This draw is probably in an area of 
high posterior mass, relative to the proposal density. Additional research is needed to develop a 
stronger theoretical justification for these recommendations, and to better understand the extend 
of any error that might be introduced by poor choices for the proposal distribution. Incidentally, 
just because the scaling factor is large does not mean that it is too large, as we saw in the example 



in Section 3.1 Also, in our experience, any theoretical deviations that one might encounter in 



practice are dwarfed by common shortcuts used in MCMC estimation. 
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4.4 Cases requiring further research 



We have demonstrated that GDS offers a viable alternative to MCMC for a large class of Bayesian 
non-Gaussian and Gaussian hierarchical models. We do not claim that GDS is appropriate for 
all models, nor do we claim that GDS would not work for any of the models we describe in this 
section. But it is not apparent how best to implement GDS for these models. 



Models with combinatorial optimization elements In models that include both discrete and 
continuous elements, finding the posterior mode becomes a mixed-integer nonlinear program 
(MINLP). An example is the Bayesian variable selection problem (George and McCulloch 1997). 
The problem is that MINLPs are known to be NP-complete, and thus may not scale well for large 
problems. Hidden Markov models with multiple discrete states might be similarly difficult to 
estimate using GDS. 



Intractable likelihoods or posteriors There are many popular models, namely binary, ordered 
and multinomial probit models, for which the likelihood of the observed data is not available 
in closed form. When direct numerical approximations to these likelihoods (e.g., Monte Carlo 
integration) is not tractable, data augmentation is a popular tool for estimating these models via 



MCMC. The Albert and Chib approach to the binary probit model in Section 3.2 is one example. 
That said, recent advances in parallelization using graphical processing units (GPUs) might make 
numerical estimation of integrals over regions that define probabilities more practical than it was 
even 10 years ago; see Suchard et al. (2010). If this is the case, and the log posterior remains 
sufficiently smooth, then GDS could be a viable, efficient alternative to data augmentation in 
these kinds of models. This is especially true if the corresponding MCMC draws are autocorre- 
lated, since GDS would require many fewer estimations of these probabilities. Certain types of 
dynamic structural models (e.g., Iyengar, et. al., 2007) might also fall into this class of problems. 



Missing data problems MCMC -based approaches to multiple imputation of missing data could 
suffer from the same kinds of problems as with multinomial probit: the latent parameter, intro- 
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duced for the data augmentation step, is only weakly identified on its own. Normally we are 
not interested in the missing values themselves. If the number of missing data points is small, 
perhaps we could treat the representation of the missing data points as if they were parameters. 
But the implications of this require additional research. 

Spatial models, and other models with dense Hessians So far in this paper, we have con- 
sidered only those models for which the individual, heterogeneous units were conditionally 
independent. GDS does not require conditional independence, and thus should be appropriate 
for spatial or contagion models (e.g., Yang and Allenby 2003). The only difficulty we foresee with 
these models is when a fast approximation of the Hessian is required. If the Hessian is dense, 
we cannot take advantage of the sparse Hessian tools in Section |4.1| 




5 Conclusions 

In this paper, we developed a new method, Generalized Direct Sampling (GDS), to sample 
from posterior distributions. This method has the potential to completely bypass MCMC -based 
Bayesian inference for large, complex models models with continuous, bounded posterior den- 
sities, even in the absence of conditional conjugacy. In addition, we showed that specifying the 
hierarchical model is much easier to do for GDS than for MCMC, and allows for greater flexibil- 
ity in model specification without sacrificing computational expediency. Finally, unlike MCMC, 
GDS generates independent draws that we can collect in parallel. Of course one could employ 
parallel computational techniques as part of a sequential algorithm. But, to repeat an earlier 
sentence, using parallel technology to generate a single draw is not the same as generating all of 
the required draws themselves in parallel. By exploiting the advantages of parallel computing, 
GDS could prove to be a successful addition to the Bayesian practitioner's computational toolkit. 
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